fix(tests): eliminate test-api shard cross-test contamination (vchord cache, tenant schemas, maintenance routine TOCTOU)#2310
Merged
Conversation
… stop cross-test contamination
The ANN tests in test_link_utils.py monkeypatch
HINDSIGHT_API_VECTOR_EXTENSION (e.g. to "vchord"). That env var is read
through the process-global config cache (get_config()), and monkeypatch
reverts only the env var on teardown — not the cache. Once get_config()
caches "vchord", it persists for the rest of the xdist worker.
Every subsequent bank-creating test on that worker then builds per-bank
vector indexes with `USING vchordrq` against the pgvector-only test DB
and fails with:
asyncpg.exceptions.UndefinedObjectError: access method "vchordrq" does not exist
cascading across dozens of unrelated tests in the test-api shard
(test_list_documents, test_maintenance_routines, test_mental_models,
test_observations, ...). Because the leak depends on which worker first
populates the cache, the failure looked like a flaky, shard-specific
infra problem.
Fix: add an autouse fixture to the class that clears the config cache
before and after each test, so the cache is rebuilt from the current
env per test and "vchord" can't leak out.
test_maintenance_multitenant provisions 100 tenant schemas by running
CREATE SCHEMA + 5×CREATE TABLE per schema. Each statement autocommitted,
so there was a window where a schema existed with only some of its
tables. The global maintenance routines (public.schemas_with_expired_rows
/ banks_needing_consolidation) discover schemas by table presence and are
exercised concurrently by test_maintenance_routines on another xdist
worker against the shared test DB. They would query a not-yet-created
table in a half-built schema and fail with:
asyncpg.exceptions.UndefinedTableError: relation "mt<hash>_NNN.memory_units" does not exist
Wrap the whole provisioning in a single transaction so the schemas
become visible to other connections only once fully built.
…utines
public.banks_needing_consolidation() and public.schemas_with_expired_rows()
snapshot the schemas owning a target table from pg_class, then run a dynamic
query against each schema in turn. That is a TOCTOU race: a schema (or its
tables) can be dropped between the snapshot and the per-schema query — a tenant
being deleted, a tenant migration recreating tables, or (in the test suite) the
multi-tenant maintenance test creating/dropping ~100 schemas concurrently with
test_maintenance_routines on the shared DB. The query then aborts the whole
routine with:
relation "<schema>.memory_units" does not exist
relation "<schema>.audit_log" does not exist
Forward migration c7e9f1a3b5d2 redefines both routines (CREATE OR REPLACE,
public/base-run gated, PG-only) so each per-schema query runs in its own
subtransaction that skips the schema on undefined_table / invalid_schema_name /
undefined_column instead of failing the scan.
Adds a deterministic regression test (schema with memory_units but no banks
table) for the skip path.
…op chunks-mode leak
test_memory_defense._make_minimal_engine() builds a MemoryEngine inside a
patch.dict that sets HINDSIGHT_API_LLM_PROVIDER=none. Constructing the engine
calls get_config(), repopulating the process-global config cache from the
patched env — and provider="none" forces retain_extraction_mode="chunks". When
patch.dict restores the env, the cache still holds the "none"/chunks config.
It then leaks to every later test on the same xdist worker: their retains run
in chunks mode (raw text, NO entity extraction), so unrelated assertions fail —
notably the test_observations entity tests ("John/Alice/Nexora entity should
exist"), which presented as a flaky, shard-specific failure (whichever entity
test landed on the poisoned worker).
Drop the config cache after the patched env is restored so the next get_config()
rebuilds from the real env. Reproduced deterministically:
pytest test_memory_defense.py::test_engine_memory_defense_shares_ext_ctx \
test_observations.py::test_entity_extraction_on_retain
# before: entity test FAILED (Insert unit_entities: 0 pairs)
# after: passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the recurring
test-apishard failures (it had been red across re-runs and blocking every code PR's CI signal). Three independent cross-test contamination / isolation bugs were stacked in thetest-api (2/3)shard, each surfacing as a deterministic-looking "infra flake". This PR fixes all three.1.
vchordconfig-cache leak (the dominant failure — dozens of tests)test_link_utils.py's ANN tests monkeypatchHINDSIGHT_API_VECTOR_EXTENSION(one setsvchord). That value is read through the process-global config cache (get_config()), andmonkeypatchreverts only the env var on teardown — not the cache. Onceget_config()cachesvchord, every later bank-creating test on that xdist worker builds per-bank indexes withUSING vchordrqagainst the pgvector-only test DB and fails with:Fix: an autouse fixture on
TestComputeSemanticLinksAnnPgBouncerSafetythat clears the config cache before/after each test (same pattern astest_worker_retry_knobs.py).2. Half-built tenant schemas in
test_maintenance_multitenantIt provisioned 100 tenant schemas with autocommitted
CREATE SCHEMA+ per-tableCREATE TABLE, leaving a window where a schema existed with only some tables.Fix: wrap provisioning in a single transaction so schemas appear atomically.
3. TOCTOU in the maintenance routines (latent prod bug)
public.banks_needing_consolidation()/public.schemas_with_expired_rows()snapshot the schemas owning a target table frompg_class, then run a dynamic query against each. A schema dropped between snapshot and query (a tenant being deleted/migrated, or the multi-tenant test's teardown running concurrently withtest_maintenance_routines) aborts the whole routine:Fix: forward migration
c7e9f1a3b5d2redefines both routines (CREATE OR REPLACE, public/base-run gated, PG-only) so each per-schema query runs in its own subtransaction that skips the schema onundefined_table/invalid_schema_name/undefined_column. Adds a deterministic regression test (schema withmemory_unitsbut nobankstable).Verification
vchordrepro flips 6 failed → 7 passed;test_link_utils.py40 passed.test_maintenance_routines+test_maintenance_multitenant) green; new resilience regression test passes.test_migration_shape.pygreen, applies cleanly (e1f2a3b4c5d6 → c7e9f1a3b5d2).ruff/tyclean (tests excluded from ruff lint per repo config).Note:
test-doc-examples (cli)failures seen on earlier runs are an unrelated Gemini real-LLM timeout /400 INVALID_ARGUMENTflake, not touched here.